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Abstract: Covariance tapering is a popular approach for reducing the computational 
cost of spatial prediction and parameter estimation for Gaussian process models. How¬ 
ever, tapering can have poor performance when the process is sampled at spatially 
irregular locations or when non-stationary covariance models are used. This work 
introduces an adaptive tapering method in order to improve the performance of ta¬ 
pering in these problematic cases. This is achieved by introducing a computationally 
convenient class of compactly supported non-stationary covariance functions, com¬ 
bined with a new method for choosing spatially varying taper ranges. Numerical 
experiments are used to show that the performance of both kriging prediction and 
parameter estimation can be improved by allowing for spatially varying taper ranges. 
However, although adaptive tapering outperforms regular tapering, simply dividing 
the data into blocks and ignoring the dependence between the blocks is often a better 
method for parameter estimation. 

Key words: Kriging; Sparse matrices; Compactly supported covariances; non- 
stationary covariances; Maximum likelihood 


1 Introduction 


Gaussian processes are important for statistical analysis of spatial data. In applications the 
goal is often to predict the process at unobserved locations, which is done by computing the 
conditional mean of the field given the data. This is referred to as the kriging prediction in 
geostatistics, and requires solving where x is a vector with observed values of the field 

and S is the covariance matrix for the field at the observation locations. Thus, if the field is 
observed at N locations, the computational cost for kriging prediction is in general 0{N^). This 
limits the applicability for large datasets and is commonly referred to as the “big N” problem. 

Covariance tapering is a popular method for handling the big N problem. The basic idea 
of this method is to set small elements in S to zero, which enables the use of computationally 
efficient sparse matrix techniques for computing 5]~^x. For spatial problems, this typically 
reduces the computational complexity from 0{N^) to The simplest way to introduce 

zeros in S is to replace the covariance function, r(h), of the Gaussian process with some 


compactly supported covariance function r(h), such as a Wendland function (Wendland 1995 


Gneiting, 20021. An alternative is to replace r(h) with a tapered version riap(h) = r(h)T(h). 


The covariance range of T'(h), referred to as the taper range, determines the sparsity of the 
resulting covariance matrix. 

Several authors have studied the effect of tapering in the case when r(h) is a Mat&n 


covariance function (Mat&n 19601. Kaufman et al. (20081 showed that certain parameters of 


the Matern covariance function can be consistently estimated using tapering, and Du et al. 


(2009), Shaby and Ruppert (2012), and Wang et al. (2011) further studied the asymptotic 


properties of tapered estimators in the case of Matern covariance models. Furrer et al. (2006) 


showed that tapering can be used with asymptotically negligible loss in the case of kriging 
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prediction for Matern fields, and Stein (20131 extended some of these results beyond the Mat&n 
model. 

However, asymptotic results do not guarantee that the method is practically useful. Stein 


(2013) used numerical experiments to show that covariance tapering often does not work as 
well for parameter estimation as simply splitting the observations into blocks and ignoring the 
dependence between the blocks. Furthermore, Bolin and Lindgren (2013) showed that tapering 
also can perform poorly for kriging prediction compared with other methods for handling the 
big N problem. A reason for this is that the computational cost for stationary tapering depends 
on the ratio of the taper range and the average spacing of the observations, whereas the accuracy 
depends on the ratio of the taper range and the range of r(h). The accuracy also depends on 
how the observations are located spatially, and more sparsely observed areas will have higher 
errors for tapered kriging predictions. This often leads to higher errors close to the boundaries 
of the observation domain as well as “spotty” predicted surfaces where sparsely observed regions 
are biased towards zero. 


Stein (2013) proposed using a simple deformation method to improve performance near 
the boundaries of the domain. However, this method only works for rectangular domains 
and does not help in sparsely observed regions. 


Anderes et al. (2013) proposed a more 


sophisticated deformation method based on quasi-conformal maps where an adaptive taper 
T(s,t) = T(||(/?(s) — 99(t)||) was defined using a stationary taper T and a warping function 
(^. A drawback with this approach is that estimation of the warping function ip is highly 
computationally demanding, which limits the usefulness of the method. 

This works introduces a new adaptive taper method that is more flexible than that by 


Stein (2013) while at the same time being more computationally efficient than the method by 


Anderes et al. (2013). This is achieved by first introducing a computationally convenient class 


of compactly supported non-stationary covariance functions in Section How to use this for 
kriging prediction is then discussed in Section The section first introduces a new method 
for choosing spatially varying taper ranges and then investigates the accuracy of the method 
using simulated data. Section presents a method for using the adaptive tapers for parameter 
estimation and investigates the accuracy of the method using numerical experiments. The 
conclusion of these comparisons is that adaptive tapering outperforms regular tapering, but 
simply dividing the data into blocks and ignoring the dependence between the blocks is often a 
better method for parameter estimation. This is in line with what Stein (2013) found. Finally, 
Section [^contains comments and suggestions for future research. 


2 Adaptive covariance tapers 


A popular method for constructing non-stationary covariance models is to use the process 


convolution approach ( 

Barry and Ver Hoef 

1996 

Higdon 

2001 Cressie and Ravlicova 

2002 

Rodrigues and Diggle 

2010), where a Gaussian stochastic field, Ai(s), is defined as the con- 


volution of a Brownian sheet B and some convolution kernel ks{u). The covariance function 
of A(s) is then given by the integral (^(s, t) = f ks(u)kt(u) du, which is non-stationary if the 
kernel changes spatially. One can also note that the support of the covariance function is de¬ 
termined by the support of the kernel. The idea here is therefore to use compactly supported 
and non-stationary kernels in the process-convolution approach to construct adaptive tapering 
functions. Thus, we choose the tapering function T(s,t) as 

T{s,t) = j ksiu)ktiu)du. (1) 

The advantage with this approach is that T(s,t) by construction is a valid covariance func¬ 
tion for any square integrable A:s(u). The disadvantage, however, is that numerical methods 
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Figure 1: The value of the non-stationary covariance function T 2 (s,t) is given by the area of 
the shaded region normalized by the areas of the two circles. 


often are required to evaluate the integral in Q. This is computationally expensive and there¬ 
fore reduces usefulness of the construction. A crucial property of the tapers we construct below 
is that the integral can be solved analytically, which means that one has an explicit form of the 
non-stationary taper function. 


2.1 Hyperspherical tapers 

The perhaps simplest choice of A:s(u) is the indicator function 


A:2,s(u) 


1 

0{s)^ 


I 



1 

6(s)0r 

0 


if ||s — u|| < 0(s)/2, 
otherwise. 


( 2 ) 


The normalization of the kernel is chosen so that the resulting taper satisfies r(s,s) = 1. In 
this parameterization, the taper range of a stationary model with 0(s) = 6 is given by 0. The 
value of T(s, t) is in this case obtained as the normalized area of the asymmetric lens produced 
by the intersection of two circles centered at s and t, with diameters 0[s) and 0(t) respectively. 
See Figure for an illustration. 

Straightforward calculations give that the resulting taper is 


72(s,t) = 


1 


T^TstRst 


V2 




? 


2dst 


+ V2 (^rst, 




2ds 


dst ^ Rst ’^st) 

Rst ^ dst Rst T '^stl 
otherwise. 


Here dgt = ||s — t||, Rgt = h max(0(s), d(t)), and rgt = h min(0(s), d(t)). Further, 


f cos ^ (f) — xVr"^ — |x| < r 

0 otherwise. 


(3) 


is the area of a circular cap with triangular height x of a circle with radius r. The case 
dst < Rst — fst does not occur if the taper range 9{s) is Holder continuous with exponent 1. 
This can be a natural condition to require in applications where the taper range should vary 
smoothly across space. 

With 0{s) = 6, the taper function reduces to 


T{d) 



cos-\l) 





for d < 6, 
otherwise. 
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and it is easy to show that this is a valid covariance function for processes on M and on but 
not for any M"* with m > 2. 

A variant of the taper is obtained if s and u are interpreted as vectors in The normalized 
kernel is then 


A; 3 ,s(u) 


%/3 


y^47r0(s)" 




(4) 


and the expression for the resulting taper is obtained as the normalized volume of the intersec¬ 
tion of two spheres in 




47rr' 


3/2 


vAr 


^sti 


dt+Rt-r: 


2dst 


+ Vs (r,t, 


di+ri-R: 


2d. 


“) 


dst < Rst "^sti 

Rst '^st ^ dst ^ Rat T '^stj 

otherwise. 


Here, Vs^r, x) denotes the volume of the spherical cap with triangular height x of a sphere with 
radius r, which is given by 


V3(r, x) 


|(r — x)^(2r -|- x) 

0 


|x| < r, 
otherwise. 


With 9{s) = 9, the taper simplifies to 


T{d) 



{29 + d){9-d)^ 


if d < 9, 
otherwise, 


which is a valid covariance function on for m < 3 and is commonly referred to as the 
spherical covariance function. 

By the same reasoning on can create a taper function by defining the generating kernel as 
a normalized hypersphere in M"'. The taper is then obtained as the normalized volume of the 
intersection of the two hyperspheres, which for a general n is 


T„(s,t; 


^ r(n/2+i) 


n ./2 


r(n/2+l) 
VnfRaU^ 


d'it+R'it-r'^.t 


2dst 


) + 14 (r-st, ■ 


d'if+rit-Rl 


2dst 


0 


If dst ^ Rst ^st ; 

if Rst st ^ dst ^ Rst T Xgt • 

otherwise. 


Here r(x) is the gamma function and 14 denotes the volume of a hyperspherical cap with 
triangular height x, of a hypersphere with radius r, which is given by 


14 (r, x) 


Tj-n/2j.n 

2T{n/2 + l) 


' T / n+1 l'\ 

-'l-(a:/r)2 f 2 ’2/ 

< 1 - h-ixivY 2 ) 

0 

\ 


0 < X < r, 

—r < X < 0, 

otherwise. 


Here, 4 is the regularized incomplete beta function, defined as 


n + 1 ^ /oVt»-i(l-^)~^dt 

2 2 / — t)~^dt 


See Li (2011) for a derivation of the volume of a hyperspherical cap. 

By construction, r„(s,t) is a valid covariance function for M™', m < n. Choosing 9{s) = 9 
results in the stationary and isotropic correlation function 


Tn{d) = In 



(5) 
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Figure 2: The stationary hyperspherical covariance functions for 0 = 1 for different values of n. 


which is shown for different values of n in Figure This function is sometimes referred to 


as Euclid’s hat and is frequently used in spatial statistics (see e.g. 

Matern 


960 Chapter 3) 

The function is continuous at the origin and has n — 1 derivatives at d = 9 

(Gneiting 

1999) 


Furthermore, Euclid’s hat with n = m is the solution to Turan’s problem, which means that 


it is the correlation function with the maximal integral over its support (see Ehm et al. 2004 


Section 4.4). This is important for tapering because one then wants to taper low values to zero 
while keeping large value relatively unaffected. Thus, these tapers are both simple to construct 
and have several desirable properties. 


2.2 Smooth tapers 

The differentiability at the origin of the taper function is an important property. When using 
a compactly supported covariance function for tapering, one generally wants it to has as many 
derivatives at the origin as the original covariance function ( ]Furrer et al. 20061. A drawback 
with the family Tn is therefore that all functions in the class have the same differentiability at 
the origin. 

To obtain a smoother taper one can replace the spherical kernel with some differentiable 
function in Q. For example, one could define an m times differentiable kernel as the convolu¬ 
tion of m spherical kernels, A:™s(u) = A:n,s(u) * ... * A:n,s(u). However, the problem with this 
construction, and most other convolution-based approaches, is that the integral in Q in general 
is not available in closed form. We believe that one could compute the taper analytically if 
/c((’'g(u) is used as a kernel, but it is not clear whether one can find some compact and simple 
expression for it. 

An alternative is to do the integration numerically. However, as soon as numerical integra¬ 
tion is required the computational cost for forming the taper matrix increases and can easily 
outweigh the cost of, for example, computing the kriging predictor using a full covariance ma¬ 
trix. We have not been able to find any non-stationary tapers based on isotropic kernels which 
are as easy to evaluate as the family, and will not pursue this issue further here. See for 


example Mateu et al. (2013) for other kernel-based approaches to constructing non-stationary 


compactly supported covariance functions. 

A simpler alternative for constructing a more differentiable non-stationary covariance func¬ 
tion is to drop the requirement that the kernel should be isotropic. One can then construct the 
kernel as a product fc™(u) = Y[i=i where k'^{u) = {u) . .*ksi{u) is the convolution 
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of m spherical kernels on M. The taper is then given by 




k^in)kr{u)du = l[ / kTM)kuiu^)du^ = l[T^is„U). 


( 6 ) 


2=1 ' 


2=1 


The spherical kernel on M is simply an indicator function and A:™ is therefore a scaled and 
shifted cardinal B-spline. Thus, to evaluate T™‘(si,ti) one has to compute the integral of the 


product of two B splines, which can be done using integration by parts (Vermeulen et al. 1992) 
For m = 1 one has kl{u) = I{\u — s| < 6{s)/2) and 


T\s,t) 


1 j |s — t|, if 2|s — t| < 6{s) + 9{t) 
^J9{s)6{t) 1^0 otherwise. 


This is the one-dimensional version of the Hyperspherical taper Ti{s), and if 9{s) = 0 it is the 
triangular covariance function. For m = 2, the kernel instead is 


ksiu) 


V3 

vW) 


1 o ^ 

^ ^ ^e{s) 


1 - 2 


e{s) 


if s — 9{s) < u < s, 
if s < u < s + 9{s), 
otherwise. 


which gives the taper 




1 

4(r/i;)3/2 


2d^ — 6d^r + 2r^(3i? — r) 

3[d^ — {d^ + rR){R -1- r) -t- d{r — i?)^] — R^ — 
6r'^{R — d) 

3[{rR — d^){r -|- i?) -|- d{R — r)^] -|- 

3[(d^ -|- rR){r + R) — d{r + R)^] + R^ + — d^ 


0 


d < min(r, R — r), 

R — r < d < r, 
r < d < R — r, 
max(r, R — r) < d < R, 
R < d < R + r, 
otherwise. 


Here, d 
to 


s—t\, r = ^ min(0(s), 9{t)), and R = ^ max(0(s), d(t)). If 9{s) = 9, T^{s, t) simplihes 


T\d) 


1 

w 


'9^ + 6d^-69d^ 

< 2{9^ - 39^d + 39d^ 
0 

\ 


0 < d < 9/2, 
d^) 9/2 <d< 9, 
otherwise. 


3 Tapering and kriging prediction 


Let X(s) be a mean-zero Gaussian random held with some covariance function r(s, t). Suppose 
that X is observed at the locations si,...,sjv and let xq = (X(si),..., X(sjv))'''. Further, 
let xi = (X(s^),..., X(s*be a vector with values that are to be predicted. The joint 
distribution of xq and xi can then be written as 




The conditional distribution of xi given xq is xi|xo N(5]io5]ooXo,Sn - JlioEgoiEoi). The 
mean xi = Xlio^lgQ^xo is the kriging predictor, and the MSE of the predictor is given by 
the diagonal of the covariance matrix, MSE(x) = diag(Sii — 5]io5]|jg^5]oi)- As previously 
mentioned, solving Sqq^xo is computationally expensive, and an alternative is to use a tapering 

estimate xi = ^lo^oo^^O; where the tilde indicates that the covariance matrices are based on 
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the tapered covariance r(s, t)T(s, t). That is, 'Sqq = 5]oo o T, where Tij = T(sj,Sj) and o 
denotes the element-wise product. The MSE of the tapered prediction is given by 

MSE(x) = diag(Sii — flio^oo ^oi — ^lo^oo ^oi + ^io^qo ^Ioo^qo Sqi), (7) 

which reduces to the MSE of the kriging predictor if T(s,t) = 1. The MSE of the tapered 
prediction is a natural measure of accuracy which we will use to compare different methods in 
what follows. 

3.1 Using adaptive tapers 

When tapering is used for kriging, the stationary taper range is usually set so that Hqo has 
a fixed percentage of non-zero values. The goal with adaptive tapering is to achieve better 
predictions without increasing the number of non-zero values in 'Sqq. To do this, the local 
taper ranges 9{s) of the adaptive taper have to be chosen. The best choice of 9{s) would be 
a function that minimizes the kriging error while keeping the number of non-zero elements in 
Hoo fixed. This is however difficult to do in practice, so we will instead split the problem in 
two parts. Since the sparsity of Sqo only is affected by the values of 0(s) at the measurement 
locations si,..., s^v, we start by choosing Oi-jy = (0(si), 0 (s 2 ),..., 9{sj\f)) in order to achieve 
a given sparsity. As a second step, we interpolate the values 6ij\f to obtain the continuous 
function 9{s) which is used to calculate Hio- The details of these two steps are discussed in 
the following two subsections. 

3.1.1 Choosing the taper range at the measurement locations 

Ideally, Oijsr should be chosen so that every row in Sqo has the same number, m, of non-zero 
elements. In other words, one would like the tapered covariance between Ai(sj) and X at the 
m—1 points nearest to Sj to be non-zero for each i. In general, one cannot get exactly m elements 
for each row since the nearest-neighbor relation is not symmetric. However, Oi jsf can be chosen 
so that the average number per row is m while keeping the variation between the rows small. 
This can be achieved using Algorithm]^ where njiOi ■n) is the number of non-zero elements of 
row j in 'Sqq for a given choice of taper ranges and rj(Oi-]sf, k) denotes the /cth largest value 
of the vector (||sj — si ||, ||sj — S 2 II,..., ||sj — sjv||)'^ — Ov.n- The parameter mi:jv = (mi,..., uin) 
is a vector containing the desired number of non-zero elements for each row. The parameter e 
determines the percentage of nodes that is allowed to have \nj{6i-i\f) —mj\ > 1, and thus serves 
as a convergence criterium. Unless the measurement locations are structured in some malicious 
way, the algorithm will converge towards a point 9i-i^ that gives close to rrij non-zero elements 
for each row j. 

The algorithm works as follows: If Sqo has too many non-zero elements at iteration i, the 
taper range of the location which corresponds to the row with the most non-zero elements 
is decreased so that that row gets the correct number of non-zero elements. If the matrix 
instead has too few non-zero elements at iteration i, the taper range of the location which 
corresponds to the row with the fewest number of non-zero elements is increased so that that 
row gets the correct number of non-zero elements. In the case that Hqo has the correct number 
of non-zero elements, but the rows have different number of non-zero elements, the row with 
the largest deviation in the number of non-zero elements is changed. If several rows have the 
same deviation, one of these are picked at random at row 16. In row 17, random numbers 
are used since deterministic values can cause taper ranges for different points to be exactly 
equal, which can cause numerical problems. When the first loop has converged, the final loop 
goes through the nodes and maximizes the taper ranges while keeping the number of non-zero 
elements constant. 
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Algorithm 1 


1 : 

2 : 

3 : 

4 : 

5 : 

6 : 

7 : 

8 : 

9 : 

10 : 

11 : 

12 : 

13 : 

14 : 

15 : 

16 : 

17 : 

18 : 

19 : 

20 : 

21 : 

22 : 

23 : 


procedure THETA-SET(mi:7v, e) 

i ^ 0 

do 


i ■(— i + 1 

j _ fl(*“l) 

'^1:7V ^ '^l:Af 


rim ^ min;(ni(0^*]v) “ "^0 

if ^ 

if -m\=0 then 

break 

else if max;(ni(0^*^jv) - mi) > \ - mi)\ then 

Um ^ max/(n/(0^*]y) - mi) 

end if 

> Nm then 

(i) 


,{i) 


else if 


N) 


Um maxz(nz(0i/^) - mi) 


end if 


j random integer from the set {I : ni{6^^^) — mi = rim} 
9j ^ Unit [mj]),rj{e^^^^, [mj] + 1 

while {E«I(l”K^i:iv) “ "^il > 1) > ^4 

for J = 1,..., A do 




end for 
return 6 




(i) 

1:N 

end procedure 


The result of the algorithm for a simple example with ten locations can be seen in the left 
panel of Figure]^ The result is visualized through the circular kernels /cs(u) for each location, 
where the radius of the circle at location Sj is given by 9s 


3.1.2 Interpolating the taper ranges 

Running Algorithm results in a vector 0i:Ar that has to be interpolated in order to generate 
a continuous function 9{s). There are many ways to do this, and one of the simplest is to 
compute a Delaunay triangulation of the locations Si,..., s^r and do linear interpolation within 
each triangle: For a location s within a triangle 7~ with vertices S(j,Sfe,Sc and corresponding 
taper ranges 9a, 9b, 9c, the interpolated taper range 9{s) is given by 9{s) = Wa9a + Wb9b + Wc9c, 
where {wa,Wb,Wc} are the barycentric coordinates of s in T. The result of this interpolation 
method, implemented using the scatteredinterpolant function in [MATLAB (20151, for a 
simple example with ten locations can be seen in the right panel of Figure 

One could imagine that more advanced interpolation methods, resulting in smoother sur¬ 
faces, could improve the results when the method is used for kriging. However, initial studies 
indicate that the interpolation method has little effect on the resulting Kriging error. For 
example, the scatteredinterpolant function also supports a natural neighbor interpolation 
method that uses Voronoi tessellations to find a smoother interpolation [C^ except at the mea¬ 
surement locations). In order to investigate if the kriging results could be improved by using 
more sophisticated interpolation methods, a part of the simulation study in Section 3.2 


was 


done using both natural and linear interpolation. Specifically, the MSE of the kriging predictor 
using linear interpolation for one of the test cases using an exponential covariance function was 


8 
















Figure 3: The left panel shows ten observation locations and the circular kernels ks{u) for each 
observation location. The diameters 9{s) of the kernels are chosen using Algorithm so that 
each kernel overlaps with two other kernels, resulting in a tapering matrix with three non-zero 
elements per row. The right panel shows the taper range of the observation locations (the 
values in the circles) and an interpolated function which is used during kriging predictions of 
unobserved locations. 

2.87 (shown as 2.9 in Table [^. Changing to the natural interpolation method instead resulted 
in an MSE of 2.78. The mean time it took to compute the interpolation was 9.1ms for the linear 
interpolation and 12.6ms for the natural interpolation. Thus, it seems as if the results for the 
adaptive tapers could be improved slightly by using more complicated interpolation methods. 
However, since the difference is small, we use the simpler linear interpolation method for the 
rest of this work, and leave the question of how to optimally interpolate the taper ranges for 
future research. 


3.1.3 Examples using simulated data 


Results of using Algorithmj^in combination with linear interpolation for three realistic scenarios 
are shown in Figure The left panels show the measurement locations and the right panels 
show the corresponding taper ranges chosen so that the tapered covariance matrix has as many 
non-zero elements as if a stationary taper range of 0.1 was used. 

The top row shows a spatially structured case, where the measurements are taken at a 
perturbed grid. The measurements are done at the locations jj^(0.5 -\-i + Uij, 0.5 + j + Vij) for 
i,j = 0,... ,31, where Uij and Vij are uniform random variables on (—0.45,0.45). Thus, this 
scenario is similar to that used in Stein (20131. For this scenario, the taper range is larger at 


the edges of the domain but fairly constant in the interior, and the results are fairly similar to 


the warping method proposed by Stein (20131. The second scenario, shown in the middle row 


of Figure]^ is complete spatial randomness, where the measurements are taken at {Uij,Vij) 
where Uij and Vij now are uniform random variables on (0,1). Finally, the third scenario shown 
in the bottom row in the figure, corresponds to clustered locations, where the measurements 
locations are drawn from a log-Gaussian cox process. That is, the locations are drawn from a 
Poisson process with intensity A(s) = exp(Z(s)) where Z{s) is a mean-zero Gaussian Matern 
field, with covariance function 




\v>Q. 


( 8 ) 
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Here k is a range parameter, the variance, is a smoothness parameter, and Ki, is a modified 
Bessel function. The distance where the covariance is approximately one tenth of the variance, 
sometimes referred to as the practical range, of the field is given by p = For the 

scenario in Figure]^ the parameter k = 10, v = 0.5, and a = 2 were used. 

Compared with the spatially structured case one can note larger variations in the taper 
ranges in the two other scenarios. The warping method in Stein (20131 cannot be used to 
construct tapers similar to our adaptive tapers in these cases, but one would then have to 


use the more complicated and computationally expensive warping methods by Anderes et al. 

(2^. 

Histograms of the number of non-zero elements per row can be seen in Figure]^ The figure 
has one panel for each of the scenarios in Figure]^ and each panel shows the histogram for the 
adaptive taper in black and the histogram for the stationary taper in grey. The total number 
of non-zero elements is the same for the two methods, but the variability of the number of non¬ 
zero elements for each row is much smaller for the adaptive tapers, especially in the clustered 
scenario. The variability of the number of non-zero elements per row for the adaptive taper 
is mainly caused by the fact that the average number of non-zero elements per row is not an 
integer. For example, for the leftmost panel it was approximately 29.6, and in order to get the 
correct average the method needs to let the rows have different number of non-zero elements. 


3.2 Numerical comparisons of kriging prediction 


In this section, stationary and adaptive tapering approximations are compared in the case of 
kriging prediction for a Gaussian Matern field on [0,1]^ with known parameters. The field is 
observed at N locations in the domain and the value of the field is predicted for all locations 
on a 50 X 50 regular lattice in the domain. The comparisons are done both with v = 0.5, which 
corresponds to an exponential covariance function, and with v = l.b which corresponds to a 
smoother covariance function. For both values of v, one case with a long correlation range, 
p = 0.2, and one case with a shorter range, p = 0.1, are tested. For all cases, cr = 1 is used. For 
each parameter configuration, the three different measurement scenarios described in Section 
3.1.3 are tested, each with N = 1024 measurement locations. 

Five different tapering functions are compared. The first is a stationary Wendland function 


Tg{h)={i-^^y {i+A^^y{\h\<e), 

the second is a stationary hyperspherical taper (|^ with re = 2, and the third is the corresponding 
non-stationary hyperspherical taper T 2 . The final two are the non-stationary product tapers 
and T^, where the latter has the same differentiability at the origin as the Wendland function. 
In order to get comparable results, the various taper ranges have to be chosen so that the 
methods have similar computational costs. To do this, the taper ranges of the two stationary 
tapers are set to 0.1. Algorithm is then used to choose the non-stationary taper ranges so 
that all tapered covariance matrices have the same number of non-zero elements. 

In summary, four different parameter settings and three different measurement scenarios 
are tested. This gives twelve test cases in total on which all five tapering methods are tested. 
For each test case, 100 different data sets are simulated, and the optimal kriging predictor and 
the different tapered approximations are computed for each data set. 

The relative increase in the MSE, compared to the optimal kriging predictor, is used as a 
measure of accuracy 


1 

lOOre 


100 n 


EE 


MSE(4^')(s,))-MSE(4^'^(s,)) 

MSE(4^')(si)) 


(9) 
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Figure 4: The left panels show three types of measurement locations: Structured locations (top), 
completely spatially random locations (mid), and clustered locations (bottom). Each example 
has 1024 measurement locations and the right panels show the corresponding non-stationary 
taper range for each scenario. The taper range is chosen using the method in Section 3.1 


so 


that the covariance matrix for the data has as many non-zero elements as if a stationary taper 
range of 0.1 was used. 
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Figure 5: Histograms of the number of non-zero elements per row in the tapered covariance 
matrix for data for the three examples in Figure]^ Structured locations (left), random locations 
(middle), and clustered locations (right). The grey curves show the stationary tapers and the 
black curves show the adaptive tapers. 


Short range Long range 




S 

R 

C 


S 

R 

C 

W{h) 

3.1 

(0.06) 

5.9 (0.18) 

9.3 (0.42) 

6.3 

(0.11) 

12.3 (0.52) 

26.6 (2.15) 

T2{h) 

1.2 

(0.02) 

2.3 (0.08) 

4.7 (0.34) 

1.8 

(0.05) 

3.7 (0.22) 

12.5 (1.64) 

T 2 {s,t) 

1.1 

(0.02) 

2.0 (0.08) 

1.2 (0.33) 

1.5 

(0.03) 

2.9 (0.15) 

1.8 (0.58) 

T\s,t) 

1.6 

(0.02) 

2.5 (0.09) 

1.5 (0.37) 

2.8 

(0.04) 

4.0 (0.15) 

2.6 (0.65) 

T‘^{s,i) 

3.3 

(0.06) 

5.7 (0.22) 

2.6 (0.88) 

7.3 

(0.11) 

12.2 (0.50) 

5.8 (2.12) 


Table 1: The relative increase in MSE shown in percent (equation © scaled by a factor 
100), for the different tapered kriging predictors compared to the optimal kriging predictor for 
the exponential covariance function. The results are shown for the Wendland taper (W), the 
stationary hyperspherical taper T 2 {h), the non-stationary hyperspherical taper T 2 {s,t), and 
the two product tapers T^{s,t) and T^(s,t) for structured (S), random (R), and clustered (C) 
sampling scenarios. The values in the parentheses are the Monte Carlo standard deviation for 
each estimate. The bold values indicate the best method for each case. 


where Xg'^^(sj) is the tapered kriging prediction at location Sj for the simulated dataset j. 
These values are shown in Table for the exponential covariance function, and in Table for 
the Matern covariance function. 

There are several things to note in the tables. For the exponential covariance the non- 
stationary Hyperspherical taper is always better than the stationary tapers. The improvement 
is relatively small for structured observation locations, where is mainly comes from better 
predictions near the boundary of the observation domain, but the improvement is larger for 
spatially random locations and the largest for clustered observation locations. One should 
generally choose a tapering function which has the same differentiability as the true covariance 
function (see e.g. Furrer et al. 20061, and this is the reason for why the hyperspherical tapers 
outperform the Wendland taper and the product taper in the exponential case. Also for 
the Matern covariance, the adaptive tapers perform the best in general. In this case, however, 
the product taper outperforms the hyperspherical taper. The reason for this is that the 
hyperspherical tapers do not satisfy the tapering condition by Furrer et al. (20061 for this case, 
and they therefore do not have asymptotically equivalent MSEs to the optimal kriging predictor. 
However, it is interesting to note that the non-stationary hyperspherical taper outperforms the 
Wendland taper in the clustered situations, despite the fact that it does not satisfy the tapering 
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Short range 



Long range 




S 

R 

C 

S 

R 

C 

W{h) 

6.7 

(0.22) 

16.0 (0.58) 

23.7 (1.23) 

22.5 (0.90) 

60.5 (3.43) 

131.7 (8.25) 

T2{h) 

14.8 

(0.20) 

21.6 (0.50) 

23.7 (0.81) 

46.7 (0.60) 

62.4 (1.52) 

89.9 (4.28) 

T 2 {s,t) 

15.0 

(0.23) 

21.9 (0.56) 

14.7 (2.65) 

45.1 (0.56) 

59.9 (1.62) 

42.3 (6.59) 

T\s,t) 

20.0 

(0.26) 

26.2 (0.58) 

17.5 (2.96) 

61.4 (0.73) 

73.6 (1.81) 

52.1 (7.12) 

T\s,t) 

6.5 

(0.24) 

15.3 (0.82) 

6.3 (2.34) 

26.8 (1.18) 

64.1 (4.31) 

28.3 (10.61) 


Table 2: The relative increase in MSE shown in percent (eqnation Q scaled by a factor 100), 
for the different tapered kriging predictors compared to the optimal kriging predictor for the 
Matern covariance fnnction. The bold valnes indicate the best method for each case. See Table 
[^for explanations of the different test cases. 


Compnte 6{s) 

W{h) 

T2{h) 

T2{s,t) 228 

T\s,t) 228 

T^{s,t) 228 


Construct T Kriging 


445 

32 

741 

8 

797 

11 

642 

28 

817 

27 


Table 3: Average timings in milliseconds for the different steps in the simnlation stndy. The 
hrst colnmn shows the time for compnting the non-stationary taper range nsing Algorithm 1 
and interpolation. The second colnmn shows the time for constrncting the tapering matrix, 
and the third colnmn shows the time for compnting the kriging predictor nsing the tapered 
covariance matrix. 


condition. It is also interesting to note that is best overall for the Matern case, despite the 
fact that it is anisotropic. Thns, non-stationarity of the tapers seems to be important for the 
accnracy of the predictions. 

Table shows timing resnlts of the different methods in the simnlation stndy, implemented 
MATLA'B](2015 I and performed nsing a Macbook Pro compnter with a 2.6GHz Intel Core i7 


m 


processor (Apple Inc., Cnpertino, CA, USA). The average time for compnting 9{s) was 228ms, 
of which approximately 219ms was spent on rnnning Algorithm and the rest was nsed for 
interpolating the resnlt. However, the timing for this step is dependent on which measnrement 
scenario that is nsed, and the strnctnred scenario resnlted in the fastest average compntation 
time (128ms) while the clnstered was the slowest (376ms). The time for constrncting the 
covariance matrices, given the taper range, was very similar for the different methods, and 
comparable to the time it took to rnn Algorithm Also the time it took to compnte the 
kriging predictor was similar for the different methods, and the reason for this is that all 
tapering matrices had the same nnmber of non-zero elements. It shonld be stressed here that 
the timings for the hrst two steps are highly implementation dependent, and especially the time 
it takes to rnn Algorithm 1 conld be greatly rednced by implementing the procednre in C or 
Fortran. 


4 Tapering and parameter estimation 

In the previons section, tapering was nsed for spatial prediction based on measnrements of a 
Ganssian random held with known covariance fnnction. In practice the parameters, of the 
covariance fnnction often have to be estimated from the data hrst, and we now look at how 
tapering can be nsed in this step. As before, xq denotes a vector of observations of a mean-zero 
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Gaussian random field X(s). The log-likelihood, that should be maximized to estimate the 
parameters, is /(^;xo) = —ilog|Soo| ~ 5^0 computational cost for evaluating 


the likelihood is 0{N^) due to both the log-determinant and the matrix inverse. The natural 
tapering approximation of the likelihood is to replace Sqo with the tap ered covariance matri x 
Xloo- However, this can cause substantial bias in the resulting estimates (Kaufman et al. 2008), 


so a common choice is to instead use the tapered likelihood [(T'; xq) = — 5 log |lloo| ~ jXq (Sqq o 

T)xo. 

In the case of kriging estimation, adaptive tapers with taper ranges that adapted to the 
measurement locations was used to improve the predictions. For parameter estimation, however, 
it is not clear whether this is the best option. We therefore propose the following method for 
selecting tapering ranges in order to generate a taper matrix with M non-zero elements. The 
method, denoted a-adaptive tapering, can be used as a tradeoff between fully adaptive tapering 
and stationary tapering. 

Let = {nil, ■ ■ ■ be a vector where ruj is the number of non-zero elements for the 

ith row in T for a stationary taper with taper range chosen so that T has M non-zero elements 
in total, and select a G [0,1]. Then, compute the adaptive taper ranges by running Algorithm 
[^with m = (1 — a)ms -)- aM/N as the vector of desired elements per row. Using a = 0 results 
in a stationary taper whereas a = 1 results in the fully adaptive taper. Note that the method 
does not depend on the covariance matrix that is to be tapered, which means that the tapering 
matrix can be computed once before estimating the parameters. 


4.1 Numerical comparisons for parameter estimation 

In this section, adaptive tapering is tested in the case of parameter estimation for two expo¬ 
nential covariance models. Since exponential covariances are used, the hyperspherical taper 
T 2 is used in combination with the a-adaptive method for selecting the taper ranges. As a 
reference method, the observations are partitioned into blocks and the true log-likelihood is 
approximated by summing the log-likelihoods of the blocks. This can be implemented as a 
block-diagonal taper with elements Tij = 0 if s, 

otherwise. Because of this, Stein ( 2013[ ) refers to the method as block tapering. 

The first test case is estimation of a stationary exponential covariance model, ([^ with 
parameters n = 0.5, k = 10, and a = 10. The second test case is a non-stationary exponential 
covariance model. Specifically, the approach by Paciorek and Schervish (2006) is used to 


j and Sj are in the same block, and Tij = 0 


produce a non-stationary exponential covariance function 


C(s„s,-) = a2|S(s,)|V4|5](s.)|V4 


S(sj) -I- XI (sj) 


- 1/2 


exp 


( 10 ) 


Here is the variance, which is set to 4 in the comparison, and 

Q., = (s. - ( 5 MiAMy ‘ (s. - s,). 

For simplicity, we choose Xl(s) = k(s)I where k(s) is the spatially varying correlation range, 
modeled using a basis expansion log k(s) = log k(sx, Sy) = ki -|- K2Sx where ki = —6 and ^2 = 6 
are used. This results in a model where the covariance range is constant in the Sy direction but 
varies in the Sx direction. 

For both the stationary and non-stationary test cases, data is generated by sampling a mean- 
zero Gaussian process X(s) with the respective covariance function at N = 1024 locations on 
[0,1]^, chosen at random according to the spatially structured case used in Section 3.2 


The 

number of blocks for the block taper and the taper ranges for the a-adaptive method are chosen 
so that all taper matrices have approximately O.OIA^^ non-zero elements. 
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Figure 6: Tapering results for the stationary (left panel) and non-statioary (right panel) ex¬ 
ponential models. Square roots of the diagonal elements of the inverse Godabme information 
matrix for the a-adaptive taper as a function of a (solid lines), and for the block taper (dashed 
lines). Note that ordinary stationary tapering corresponds to the case a = 0. 


We use the diagonal of the inverse Godambe information matrix (see e.g. 
as a measure of the efficiency of the estimation methods based on different tapering func¬ 
tions. The elements correspond to the asymptotic marginal variances of the estimates ob¬ 
tained using the tapered likelihoods. The Godambe information for the tapered loglikelihood 
is G(T') = E^(H)E^(vv’r)-iE^(H), wliGrG V — BjIicI H — xq). TIig 

two expectations can be computed as E^(vv’^)y = pr (^(S- ^ o T)5]oo(S:j ^ o T)S:oo) and 

E^(H)i, = pr where = -Sqo and ^ ^ o T. 

Figure displays the square roots of the diagonal elements of the inverse Godambe infor¬ 
mation matrix, as functions of a for the a-adaptive method together with the corresponding 
values for the block taper for the two test cases. Recall that a = 0 corresponds to stationary 
tapering whereas a = 1 corresponds to fully adaptive tapering that minimizes the variation 
of non-zero elements between rows in the tapering matrix. The left panel of the figure shows 
the test case for the stationary exponential model, and one can note that the adaptive tapers 
outperform the stationary taper, and that an optimal value of a in this case is around 0.4. The 
right panel shows the non-stationary test case. The adaptive tapers outperform the stationary 
taper also in this test case, and the optimal value is in this case 1, meaning that the fully 
adaptive taper should be used. This is of little value, however, as the block taper is even better 
than the a-adaptive tapers for both cases. 

These are of course just two examples and different choices of true covariance function 
and measurement locations would have resulted in different optimal values for a. However, 
the general conclusion has been the same for most cases we have tried so far: Non-stationary 
tapering outperforms stationary tapering, but the block tapers outperform both. 

5 Discussion 

We have presented a class of computationally convenient compactly supported non-stationary 
covariance functions. Together with the proposed methods for selecting non-stationary taper 
ranges, these functions can be used for spatially adaptive covariance tapering. 

Numerical experiments showed that the adaptive tapering method can improve the perfor¬ 
mance of tapering for both kriging and parameter estimation, with negligible increase in com- 


Stein 2013 


15 





















putational cost. For parameter estimation, however, the numerical experiments also showed 
that dividing the data into blocks and ignoring the dependence between the blocks is often a 
better method than tapering. Thus, even if adaptive tapering is better than stationary taper¬ 
ing, we agree with Stein (20131 that tapering seems to have little practical value for parameter 
estimation. 

An alternative to using the adaptive tapers for parameter estimation in applications is to 
use them directly as non-stationary covariance models for the data. In this case, an interesting 
model extension is to allow for anisotropy in the covariance function by using elliptical kernels 
in the construction. With this straightforward extension, the model could be used as a compu¬ 
tationally efficient alternative to the non-stationary models by Paciorek and Schervish (2006). 
Another relevant extension is to derive the expressions for the differentiable tapers, obtained 
using kernels that are convolutions of the spherical kernels. This would further increase the 
value of this class of covariance models. Finally, another interesting direction of future research 


is to use the adaptive tapers in an adaptive MCMC method similar to the method by Wallin 


and Bolin (20151. 
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